Objectives:-
|
To aid data analysis and maintenance, a number of clustering algorithms
have been proposed to partition a large data source into meaningful clusters.
Selecting an appropriate clustering algorithm that can help the process of
understanding a large data source is a challenging issue.
|
Functional Specifications:-
|
The effectiveness of a particular algorithm may be influenced by a number of
different factors. However, the influence of a particular factor like quality
can be considered to identify the effectiveness of an algorithm for a given data
source. The project provides a comparative analysis of four clustering
algorithms namely K-Means, Partitioning Around Medoids, Minimum Spanning Tree
and Nearest Neighbor applied to diabetes dataset. Besides rapidly generating the
clusters, the analysis also provides a basis for determining the quality of the
clusters generated and helps in identifying the algorithm that generates good
quality clusters. As Data Characterization is a summarization of the general
characteristics or features of a target class of data, the characteristics of
diabetes data are also analyzed taking into account positively tested records as
target class of data using the approach of Attribute Oriented Induction.
|
User Interface:- |
Windows based user inteface with ease of use. |
Preferred Technologies:- |
Java (Applets, AWT Events and Swings) or C#.Net 2.0 or Vb.Net 2.0 |
About
clustering:-
Cluster computing is not a
new area of computing. It is, however, evident that there is a growing interest
in its usage in all areas where applications have traditionally used parallel or
distributed computing platforms. The mounting interest has been fuelled in part
by the availability of powerful microprocessors and high-speed networks as
off-the-shelf commodity components as well as in part by the rapidly maturing
software components available to support high performance and high availability
applications. This rising interest in clusters led to the formation of an IEEE
Computer Society Task Force on Cluster Computing (TFCC1 ) in early 1999.
A “commodity cluster” is a local computing system comprising a set of independent computers and
a network interconnecting them. A cluster is local in that all of its component
subsystems are supervised within a single administrative domain, usually
residing in a single room and managed as a single computer system. The
constituent computer nodes are commercial-off-the-shelf (COTS), are capable of
full independent operation as is, and are of a type ordinarily employed
individually for standalone mainstream workloads and applications. The nodes may
incorporate a single microprocessor or multiple microprocessors in a symmetric
multiprocessor (SMP) configuration. The interconnection network employs COTS
local area network (LAN) or systems area network (SAN) technology that may be a
hierarchy of or multiple separate network structures. A cluster network is
dedicated to the integration of the cluster compute nodes and is separate from
the cluster’s external (worldly) environment. A cluster may be employed in many
modes including but not limited to: high capability or sustained performance on
a single problem, high capacity or throughput on ajob or process workload, high
availability through redundancy of nodes, or high bandwidth through multiplicity
of disks and disk access or I/O channels. A “Beowulf-class system” is a cluster
with nodes that are personal computers (PC) or small symmetric multiprocessors
(SMP) of PCs integrated by COTS local area networks (LAN) or system area
networks (SAN), and hosting an open source Unix-like node operating system. An
Windows-Beowulf system also exploits low cost mass market PC hardware but
instead of hosting an open source Unixlike O/S, it runs the mass market widely
distributed Microsoft Windows and NT operating systems. A “Constellation”
differs from a commodity cluster in that the number of processors in its node
SMPs exceeds the number of SMPs comprising the system and the integrating
network interconnecting the SMP nodes may be of custom technology and design.
Definitions such as these are useful in that they provide guidelines and help
focus analysis. But they can also be overly constraining in that they
inadvertently rule out some particular system that intuition dictates should be
included in the set. Ultimately, common sense must prevail. |
|